29 research outputs found

    Privacy and Transparency in Graph Machine Learning: A Unified Perspective

    Full text link
    Graph Machine Learning (GraphML), whereby classical machine learning is generalized to irregular graph domains, has enjoyed a recent renaissance, leading to a dizzying array of models and their applications in several domains. With its growing applicability to sensitive domains and regulations by government agencies for trustworthy AI systems, researchers have started looking into the issues of transparency and privacy of graph learning. However, these topics have been mainly investigated independently. In this position paper, we provide a unified perspective on the interplay of privacy and transparency in GraphML

    User Fairness in Recommender Systems

    Full text link
    Recent works in recommendation systems have focused on diversity in recommendations as an important aspect of recommendation quality. In this work we argue that the post-processing algorithms aimed at only improving diversity among recommendations lead to discrimination among the users. We introduce the notion of user fairness which has been overlooked in literature so far and propose measures to quantify it. Our experiments on two diversification algorithms show that an increase in aggregate diversity results in increased disparity among the users

    Boilerplate Removal using a Neural Sequence Labeling Model

    Full text link
    The extraction of main content from web pages is an important task for numerous applications, ranging from usability aspects, like reader views for news articles in web browsers, to information retrieval or natural language processing. Existing approaches are lacking as they rely on large amounts of hand-crafted features for classification. This results in models that are tailored to a specific distribution of web pages, e.g. from a certain time frame, but lack in generalization power. We propose a neural sequence labeling model that does not rely on any hand-crafted features but takes only the HTML tags and words that appear in a web page as input. This allows us to present a browser extension which highlights the content of arbitrary web pages directly within the browser using our model. In addition, we create a new, more current dataset to show that our model is able to adapt to changes in the structure of web pages and outperform the state-of-the-art model.Comment: WWW20 Demo pape

    The Multiple-orientability Thresholds for Random Hypergraphs

    Full text link
    A kk-uniform hypergraph H=(V,E)H = (V, E) is called ℓ\ell-orientable, if there is an assignment of each edge e∈Ee\in E to one of its vertices v∈ev\in e such that no vertex is assigned more than ℓ\ell edges. Let Hn,m,kH_{n,m,k} be a hypergraph, drawn uniformly at random from the set of all kk-uniform hypergraphs with nn vertices and mm edges. In this paper we establish the threshold for the ℓ\ell-orientability of Hn,m,kH_{n,m,k} for all k≄3k\ge 3 and ℓ≄2\ell \ge 2, i.e., we determine a critical quantity ck,ℓ∗c_{k, \ell}^* such that with probability 1−o(1)1-o(1) the graph Hn,cn,kH_{n,cn,k} has an ℓ\ell-orientation if cck,ℓ∗c c_{k, \ell}^*. Our result has various applications including sharp load thresholds for cuckoo hashing, load balancing with guaranteed maximum load, and massive parallel access to hard disk arrays.Comment: An extended abstract appeared in the proceedings of SODA 201

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    Multiple choice allocations with small maximum loads

    Get PDF
    The idea of using multiple choices to improve allocation schemes is now well understood and is often illustrated by the following example. Suppose nn balls are allocated to nn bins with each ball choosing a bin independently and uniformly at random. The \emph{maximum load}, or the number of balls in the most loaded bin, will then be approximately log⁥nlog⁥log⁥n\log n \over \log \log n with high probability. Suppose now the balls are allocated sequentially by placing a ball in the least loaded bin among the k≄2k\ge 2 bins chosen independently and uniformly at random. Azar, Broder, Karlin, and Upfal showed that in this scenario, the maximum load drops to log⁥log⁥nlog⁥k+Θ(1){\log \log n \over \log k} +\Theta(1), with high probability, which is an exponential improvement over the previous case. In this thesis we investigate multiple choice allocations from a slightly different perspective. Instead of minimizing the maximum load, we fix the bin capacities and focus on maximizing the number of balls that can be allocated without overloading any bin. In the process that we consider we have m=⌊cn⌋m=\lfloor cn \rfloor balls and nn bins. Each ball chooses kk bins independently and uniformly at random. \emph{Is it possible to assign each ball to one of its choices such that the no bin receives more than ℓ\ell balls?} For all k≄3k\ge 3 and ℓ≄2\ell\ge 2 we give a critical value, ck,ℓ∗c_{k,\ell}^*, such that when cck,ℓ∗cc_{k,\ell}^* this is not the case. In case such an allocation exists, \emph{how quickly can we find it?} Previous work on total allocation time for case k≄3k\ge 3 and ℓ=1\ell=1 has analyzed a \emph{breadth first strategy} which is shown to be linear only in expectation. We give a simple and efficient algorithm which we also call \emph{local search allocation}(LSA) to find an allocation for all k≄3k\ge 3 and ℓ=1\ell=1. Provided the number of balls are below (but arbitrarily close to) the theoretical achievable load threshold, we give a \emph{linear} bound for the total allocation time that holds with high probability. We demonstrate, through simulations, an order of magnitude improvement for total and maximum allocation times when compared to the state of the art method. Our results find applications in many areas including hashing, load balancing, data management, orientability of random hypergraphs and maximum matchings in a special class of bipartite graphs.Die Idee, mehrere Wahlmöglichkeiten zu benutzen, um Zuordnungsschemas zu verbessern, ist mittlerweile gut verstanden und wird oft mit Hilfe des folgenden Beispiels illustriert: Man nehme an, dass n Kugeln auf n BehĂ€lter verteilt werden und jede Kugel unabhĂ€ngig und gleichverteilt per Zufall ihren BehĂ€lter wĂ€hlt. Die maximale Auslastung, bzw. die Anzahl an Kugeln im meist befĂŒllten BehĂ€lter, wird dann mit hoher Wahrscheinlichkeit schĂ€tzungsweise log⁥nlog⁥log⁥n\log n \over \log \log n sein. Alternativ können die Kugeln sequenziell zugeordnet werden, indem jede Kugel k ≄ 2 BehĂ€lter unabhĂ€ngig und gleichverteilt zufĂ€llig auswĂ€hlt und in dem am wenigsten befĂŒllten dieser k BehĂ€lter platziert wird. Azar, Broder, Karlin, and Upfal haben gezeigt, dass in diesem Szenario die maximale Auslastung mit hoher Wahrscheinlichkeit auf log⁥log⁥nlog⁥k+Θ(1){\log \log n \over \log k} +\Theta(1) sinkt, was eine exponentielle Verbesserung des vorhergehenden Falls darstellt. In dieser Doktorarbeit untersuchen wir solche Zuteilungschemas von einem etwas anderen Standpunkt. Statt die maximale Last zu minimieren, ïŹxieren wir die KapazitĂ€ten der BehĂ€lter und konzentrieren uns auf die Maximierung der Anzahl der Kugeln, die ohne Überlastung eines BehĂ€lters zugeteilt werden können. In dem von uns betrachteten Prozess haben wir m = bcnc Kugeln und n BehĂ€lter. Jede Kugel wĂ€hlt unabhĂ€ngig und gleichverteilt zufĂ€llig k BehĂ€lter. Ist es möglich, jeder Kugel einen BehĂ€lter ihrer Wahl zuzuordnen, so dass kein BehĂ€lter mehr als Kugeln erhĂ€lt? FĂŒr alle k ≄ 3 und ≄ 2 geben wir einen kritischen Wert ck,ℓ∗c _{k,\ell}^*, an sodass fĂŒr c c {k,\ell}^*nicht.ImFalle,dasssolcheineZuordnungexistiert,stelltsichdieFrage,wieschnelldiesegefundenwerdenkann.DiebisherdurchgefušhrtenArbeitenzurGesamtzuordnungszeitimFallek≄3and nicht. Im Falle, dass solch eine Zuordnung existiert, stellt sich die Frage, wie schnell diese gefunden werden kann. Die bisher durchgefĂŒhrten Arbeiten zur Gesamtzuordnungszeit im Falle k ≄ 3 and \ell = 1habeneineBreitensuchstrategieanalysiert,welchenurimErwartungswertlinearist.WirprašsentiereneineneinfachenundeïŹƒzientenAlgorithmus,welchenwirlocalsearchallocation(LSA)nennenundderZuteilungenfušrallek≄3und haben eine Breitensuchstrategie analysiert, welche nur im Erwartungswert linear ist. Wir prĂ€sentieren einen einfachen und eïŹƒzienten Algorithmus, welchen wir local search allocation (LSA) nennen und der Zuteilungen fĂŒr alle k ≄ 3 und \ell = 1$ ïŹndet. Sofern die Anzahl der Kugeln unter (aber beliebig nahe an) der theoretisch erreichbaren Lastschwelle ist, zeigen wir eine lineare Schranke fĂŒr die Gesamtzuordnungszeit, die mit hoher Wahrscheinlichkeit gilt. Anhand von Simulationen demonstrieren wir eine Verbesserung der Gesamt- und Maximalzuordnungszeiten um eine GrĂ¶ĂŸenordnung im Vergleich zu anderen aktuellen Methoden. Unsere Ergebnisse ïŹnden Anwendung in vielen Bereichen einschließlich Hashing, Lastbalancierung, Datenmanagement, Orientierbarkeit von zufĂ€lligen Hypergraphen und maximale Paarungen in einer speziellen Klasse von bipartiten Graphen

    The Multiple-Orientability Thresholds for Random Hypergraphs

    Get PDF
    A k-uniform hypergraph H = (V, E) is called l-orientable if there is an assignment of each edge e is an element of E to one of its vertices v is an element of e such that no vertex is assigned more than l edges. Let H-n,H-m,H-k be a hypergraph, drawn uniformly at random from the set of all k-uniform hypergraphs with n vertices and m edges. In this paper we establish the threshold for the l-orientability of H-n,H-m,H-k for all k >= 3 and l >= 2, that is, we determine a critical quantity c(*)k,l such that with probability 1-o(1) the graph H-n,H-cn,(k) has an l-orientation if c c(k,l)(*) . Our result has various applications, including sharp load thresholds for cuckoo hashing, load balancing with guaranteed maximum load, and massive parallel access to hard disk arrays
    corecore